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(54) A system for locating automatically video segment boundaries and for extraction of 
key-frames 



(57) The present invention describes an automatic 
video content parser for parsing video shots such that 
they are represented in their native media and retrieva- 
ble based on their visual contents. This system provides 
methods for temporal segmentation of video sequences 
into individual camera shots using a novel twin-compar- 
ison method. The method is capable of detecting both 
camera shots implemented by sharp break and gradual 



transitions implemented by special editing techniques, 
including dissolve, wipe, fade-in and fade-out; and con- 
tent based key frame selection of individual shots by an- 
alysing the temporal variation of video. content and se- 
lect a key frame once the difference of content between 
the current frame and a preceding selected key frame 
exceeds a set of preselected thresholds. 
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Description 



,ion ST 656 "' in . Ven l i0n fela,eS ,0 VidS ° indeXin9 ' archivin 9- Siting and production, and. more particularly, this inven- 
tion teaches a system for parsing video content automatically. 

f«r J" t0day ' S in,ormat ; on world - the importance of acquiring the right information quickly is an essential aspect we all 
face. Given so much information and databases to process, how do we make full use of the technological advancement 
to our advantage is what this present invention is addressing. Information about many major aspects of the world in 

HowLerT;^" ?* * """^ " hfln PreSenXed h 3 "me-varying manner such as video source! 

|*!w»Ew%iya!t* effective use of video sources is seriously limited by a tack of viable systems that enable easy and effective 
2ET^? * '"formation from these sources. Also, the time-dependent nature of video makes it TvTry 

difficult medium to manage. Much of the vast quantity of video containing valuable information remains unindexed Thte 
» because indexing requires an operator to view the entire video package and to assign index means manually to each 
of rts scenes. Obviously this approach is not feasible considering the abundance of unindexed videos and the lack of 
sufficient manpower and time. Moreover, without an index, information retrieval from video requires an operator to view 
the source during a sequential scan, but this process is slow and unreliable, particularly when comparedwith analogous 
retrieval techniques based on text. Therefore, there is clearly a need to present a video package invery much the sSne 
way as a book with index structure and a table of content. Prior art teaches segmentation algorithm to detect 
camera breaks, but no method detects gradual transitions implemented by special editing techniques includfcg dVsso^ 

EE 5? w T rT P w° r "I ^ 33 " AUt ° ma,iC Vid6 ° ' ndeXin9 and Fu,, - Video Search for Appearances, : 
r«T™ ^ h, 9 0 ™ V,SUa ' Databased S V*«™. Budapest, 1991, pp. 119-133 by A NagasakaS Y Tanaka 
and Video Handling Based on Structured Information for Hypermedia Systems,- Proc. Int'l C on? on Multimedia InS 

SSL 2?TV ^ 1 91 ' ^ 333-344 by Y TOn ° mUra t6aCh S69ment ^ndaries detection methods but only 

Z^,! I p C 9 S i arP Came,a breakS An ° ther imP ° rtant area ° f Video indexin 9 is to selec « " representative frame 
known as Key Frame. This selection process as taught by prior art is based on motion analysis of shots and is complicated 
and prone to noise. K">^"p«-> 

It is therfore an object of the present invention to automate temporal segmentation (or partitioning) of video se- 
quences into individual camera shots by distinguishing between sharp breaks and gradual transitions implemented bv 
specie, effects Such partition is the first step in video content indexing which is currently being carried out' mTnuSy by 
an operator a time consuming, tedious and unreliable process. .wnuaiiyoy 
^ It is another object of the present invention to provide content based key frame selection of individual shots for 
representing, indexing and retrieval of the shots in a multimedia manner 

. hfiU ^™? 9ty ' t t h H^ Sent in . Venti ° n describes 30 automatic video content parser for parsing video shots such that 
hey are represented ,n the.r native media and retrievable based on their visual contents. This system provides methods 

S hnin ? T"T im P ,emented «* shar P °'*ak and gradual transitions implemented by special editing 
IftSS * . k 9 ? ' T 6, ,ad8 " in ,ade<,Ut - 7,16 SySt6m a,S ° P rovides ^ent based key frame selection 

intent nlf J, y y ?? 9 me ,e T P ° ral Variati0 " ° f Vid6 ° COn,ent and se,ects a ke V fra ™ onoe the difference of 
content between the current frame and a preceding selected key frame exceeds a set of preselected thresholds 

For a better understanding of the present invention and to show how the same may be carried into effect reference 
will now be made by way of example to the accompanying drawings in which:- reterence 

FIG. 1 A shows a flow chart of a known video segmentation method. 

FIG. 1 B shows an example of a typical sharp cut in a video sequence. 

FIG. 2 is a block diagram showing an overview of a video content parsing system of the present invention 

,r a nfJS ,° WS a .^ Chart ,0f P 8 *™** tem P° ral segmentation capable of both detecting sharp cuts and gradual 
transitions implemented by special effects of the present invention. »'«»■«■ 

FIG. 3B shows an example of a typical dissolve transition found in a video sequence 
anri F Jf 3C ShOW f a " exa mple of the frame-to-frame difference values with high peaks corresponding to sharp cuts 
and a sequence of medium peaks corresponding to a typical dissolve sequence. 

FIG. 4 shows a flow chart for extracting key frame of the present invention 

the Zlo ^7^^ " * ** aUt0ma,iCally anting video and extracting key frame according to 

th. in!In,^ » l09,e ! arC US8d ln thS desCrib,ion of this P resent invention. Accordingly, for better understanding of 

the invention, definition of some of these terminologies are given as follows. 

A -segment- or a "cut- is a single, uninterrupted camera shot, consisting of one or more frames 

as a sharp break occurs between two frames belonging to two different shots), or as special effects such as dissolve 
wipe, fade-in and fade-out in which case, transition occurs across more than one frames cssoive. 
Temporal segmentation- or 'scene change detection" of a video is to partition a given video sequence temporally 
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into individual segments by finding the boundary positions between successive shots. 

"Key frames" are those video frames which are extracted from a video segment to represent the content of the video 
or segment in video database. The number of key frames extracted from a video sequence is always smaller than the 
number of frames in the video sequence. Thus, a sequence of key frames is considered an abstraction of the video 
5 sequence from which they are extracted. 

"Threshold 0 is a limiter used as a reference for checking to see whether a certain property has satisfied a certain 
criteria which is dictated by this limiter to define a boundary of a property. The value of threshold t, used in pair-wise 
comparison for judging whether a pixel or super-pixel has changed across successive frames is determined experimen- 
tally and it does not change significantly or different video sources. However, experiments have shown that the thresholds 
10 T b and T p used for determining a segment boundary using any of the difference metric (as defined below) varies from 
one video source to another. 

"Difference metrics" are mathematical equations, or modifications thereof, adapted for analysing the properties of, 
in this case, video content. The different difference metrics includes: 

is • Pair-wise pixel comparison, whereby a pixel is judged as changed if the diference between the intensity values in 
the two frames exceeds a given threshold /. This metric may be represented as a binary function DP,{k,[) over the 
domain of two-dimensional coordinates of pixels, (k,f), where the subscript i denotes the index of the frame being 
compared with its successor. If Pf,k,f) denotes the intensity value of the pixel at coordinate (k,f) in frame i, then DP, 
(k,f) may be defined as: 
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1 if\P.(kJ)-P i:¥j (k,l^>t 
0 otherwise 



The pair-wise comparison algorithm simply counts the number of pixels changed from one frame to the next 
according to the above metric. A segment boundary is declared if more than a given percentage of the total number 
of pixels (given as a threshold 7) have changed. Since the total number of pixels in a frame of dimension M by N 
is M*N> this condition may be represented by the following inequality: 

M.N 

^ — : — *ioo>r 
M* N 

Likelihood ratio is a comparison of corresponding regions or blocks in two successive frames based on the sec- 
ond-order statistical characteristics of their intensity values. Let m,- and m M denote the mean intensity values for a 
given region in two consecutive frames, and let S f and S M denote the corresponding variances and the ratio is 
defined as: 



> t 



A segment boundary is declared whenever the total number of sample areas whose likelihood ratio exceeds 
the threshold t is sufficiently large (where "sufficiently large" depends on how the frame is partitioned). This metric 
is more immune to small and slow object motion moving from frame to frame than the preceding difference metrics 
and therefore less likely to be misinterpreted as camera breaks. 

Histogram comparison is yet another algorithm that is less sensitive to object motion, since it ignores the spatial 
changes in a frame. Let Hfj) denotes the histogram value for the ith frame, where j is one of the G possible pixel 
values. The overall difference SD, is given by: 



A segment boundary is declared once SD,- is greater than a given threshold T. 
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# SL t6St fe »l m0dffi ! d ? rei0n ° f 3bOVe immediate «*«ion makes the histogram comparison reflect the 

drff erence between two frames more strongly. f 

fpJJ2S! abOVS me 1 t ^„ may be im P'emented with different modifications to accommodate the idiosyncrasies of drf- 
ritol^r w priorart, the video segmentation of the present invention does not limit itself toany particular 
difference metric and a single specified threshold. As such it is versatile. t*«"cuiar 
FIG. 1 A shows a flow chart of known video segmentation method. This algorithm detects sharp cuts in video se- 
quencesjhis « achieved by first, after initializing the standard system parameters at block 101. a first reference frame 

baSon^ r'Tn^' dif,erenC6 ' ° h bGtWeen framSS F ^ and ^ «*"* 15 a ski P facto; S away) fe ScuS 
based on a selected drfference metric at block 1 04. If D, is greater than a preset threshold, a change of shot is detected 
and recorded at block 106 before proceeding to block 108. Otherwise, it proceeds directly to block 109 Xo 
new frame. If it m the last frame, the segmentation process is completed; otherwise, it proceeds to block 1 03 to re pea1 

S^TST L ,r T ° f * e Vide ° SeqUenCS iS reaCh6d The OUtpu ' is a *» « fra ™ numbers anSoS 

•2Sif f th8 Start '" 9 and end,n 9 frames of Gach sno ' **** from the input video sequence. This method is 
suitable for use in detect.ng sharp transition between camera shots in a video or film such as that depicted in FIG 1 B 
The content between shot 110 and shot 112 is completely different from one another. 

FIG. 2 .s a block diagram showing an overview of a video content parsing system of the present invention The 
system comprises of a computer 202, containing Program for Interface block 210, Data Processing Modules block 212 

SSTZSTh h ^ T^r™ ° f PreSent inVenti ° n> and b,OCk 21 4 for lo 99 in 9 the information associated 
with segment boundaries and key frames. Connected to the computer are: User Interface Devices 204, an optional 

i° ra9e S ^ ^ inpUt ™ e ° SOUfCe 20a The input video data «" be erth ^ anSoQor 

5£ v^rZSrT^ and be °" any type ° f stora " medium - ™ s « -* 

A. First Embodiment of Invention 



FIG. 3A shows a flow chart for performing temporal segmentation capable of both detecting sharp cuts and Gradual 

f romlnT^o^ 6 " 18 ' h V S r " SffeCtS ° f ^ firSl embodiment * P^ent invention. This^mb^en^s dSen 
from the above described video segmentation algorithm of FIG, 1A. It is detects sophisticated transition techniques 

*s LoinTs'ofthT^'r'' 6 ' I**™' fa6e -° UX bV USi " 9 3 n ° Vel t-n-comparison method to find the stamng SSSZ 
Keen J^SST «7 f * ^"T™ 8 h,St0giam differ6nCe metric to calculate the difference value 

th™ f ? L ? 806,18 Chang8 WheneV6r ,he resun is 9 fea,er than a specified threshold. In contrast 

t P ' ,0 " 88 ^ itS8lf b/ ^ USe ° f any Particular difference metric and a ^ngle specified threshold 
The novel twin-companson method uses more than one thresholds to detect both sharp camera breaks and gradual 

^ hansrt.ons. It mtroduces another diference comparison name* accumulated difference comparison as SjSdE fS? 

f ram «flt w 9 ?S5 S 3A> aft8r inttiali2ation of svs tem parameters at block 302, and loaded in frame / (current 
Snl rf k h PrOCe8S b8QinS ,0 C ° mpare preViOUS ,rame " S and cu " ent f ' h y calculating the 

? a^^ 

J 4 a "? Zl" S ^i a/ f 8 (that ,S ' " 0t 3 ,ransrtlon period vet >- then > ■ P««eeds to block 308; otherwise, it goes to block 

TrL^T^u^J!*™ COUn, 3 Sh0,> 18 001 9rea,Sr th3n "*"*> minimum number of fra ™ a shot, then it 
«Zr«T k ? Wh8r8 ,ncremented b V S - a tem P°^l skip factor between two frames being compared in the 

22 2? i , £TT 9 1° PrOC8SS ,h8 remaining frames in me video se 9 uence ^ block 336; otherwise a cut is 

3 f ° l Startin9 ? ,rame Fsand 8nding 31 ^ Fe " feCOrded 31 b,OCk 310 tollow b * reinilia " 
to Sk 3^ it lit? £ 8 r o ame " r! f ; ameCOUnt - ^ to 2sro and 861 Tmns to fefee at block 312 before proceeding 
oreaS S 1 f o ' * ^J*^ a9a,nSt a ter 9 er threshold «^ where « is a user tunable parameter and is 

b Lk^ fi m ?x 9 I and ' S ,roe ,he " 8 v™*™** cut is d eclared at point Pf ; otherwise, it proceeds to 

SS^hETi. h 3 ! 8 l8 - ,OU ^ 10 69 ,alSe ' ,h3t fe ' n0t 9 transition period ' °< is com P ared gainst a predefined 
hTJTJS threshold, T at btock 330. If D, is greater than T t and the number of frame count in a shot, X J 'is greater 
han the minimum number of frame for a shot. N smin , at block 332, a potential transition is found at P2 and duoly Trsns 

tobtlk^l v rre Par ^^ 

Hn«muof» ^T.^ ,n " emen,ed ^ S be,ore ^""^ to P^^ the remaining frames in the video sequence 
lesser, then. Z F (frame count m a shot). L te (accumulative differences in a transition sequence) and (frame count S 



4 



BNSDOCIO <EP 0690413A2J_> 



EP0 690 413 A2 



a transition process) are updated at block 31 7 before continuing to process the remaining frames in the video sequence; 
otherwise, it proceeds to calculate D a , the difference between current frame / and start frame of the potential transition, 
F p , that is, between image feature of current frame, f h and image feature of the potential transition frame, f p . At block 
318, if D Q is not greater than T b or Z ta /2, tF is not greater then yT t (where y g 1 ) it goes to block 320; otherwise, the end 

s of a transition has been successfully detected at point P3. Accordingly, a transition starting at F & -F p and ending at F e 
is declared and recorded at block 322 before reinitializing relevant parameters at blocks 324 and 326 and then continues 
to process the next frame in the video sequence. At block 320, if the number of failed frame count for the period of 
transition, £, m , is not less than the maximum allowable number of fails in a transition, N tmma)n then the current transition 
- is found to be falsed and deemed failed at block 328 at point P4\ otherwise, Z tnr X te and £, F are updated, that is, still in 

10 transition period, before proceeding to block 336. Here, the position of the current frame is incremented by a pre-defined 
skip factor, S, the start of the next frame for processing; and if it has not reached the end of the frame in the sequence, 
then repeat the process from block 304. 

The flow chart as depicted in FIG. 3A is capable of detecting gradual transitions such as a dissolve as shown in 
FIG. 3B. The original scene 340 is slowly being superimposed by another scene at 342, 344 and 346 and finally fully 

is dominated by that new scene 348. Effectively, the task of the algorithm as described above is to detect these first and 
last frames of the transition sequences in a video sequence. FIG. 3C depicts an example of the frame-to-frame difference 
values, showing high peaks corresponding to sharp cuts at 350 and 352; and a sequence of medium peaks correspond- 
ing to a typical dissolve sequence 354. 

A single-pass approach depicted in FIG. 3A has disadvantage of not exploiting any information other than the thresh- 

20 old values, thus, this approach depends heavily on the selection of those values. Also, the processing speed is slow. A 
straightforward approach to reducing processing time is to lower the resolution of the comparison, that is, examining 
only a subset of the total number of pixels in each frame. However, this is clearly risky, since if the subset is too small, 
the loss of spatial detail (if examining in spatial domain) may result in a failure to detect certain segment boundaries. A 
further improvement could be achieved by employing a novel multiple -pass approach to provide both high speed process- 
es jng, the amount of improvement depends on the size of the skip factor and the number of passes, and accuracy of the 
same order in detecting segment boundaries. In the first pass, resolution is sacrified temporally to detect potential seg- 
ment boundaries with high speed. That is, a "skip factor", S, in the video segmentation process is introduced. The larger 
the skip factor, the lower the resolution. (Note that this skip factor is in general larger than the one used in the description 
of the above and below flow charts.) For instance, a skip factor of 10 means examining only one out of 10 frames from 

30 the input video sequence, hence reducing the number of comparisons (and, therefore, the associated processing time) 
by the same factor. In this process twin comparison for gradual transitions as described in FIG. 3A is not applied. Instead, 
a lower value of T b is used; and all frames having a difference larger than T b are detected as potential segment boundary. 
Due to the lower threshold and large skip factor, both camera breaks and gradual transitbns, as well as some artifacts 
due to camera or object movement, will be detected. False detections fall under the threshold will also be admitted, as 

35 long as no real boundaries are missed. In the second pass, ail computations are restricted to the vicinity of these potential 
boundaries and the twin-comparison is applied. Increased temporal (and spatial) resolution is used to locate all bound- 
aries (both camera breaks and gradual transitions) more accurately, thus recovers the drawback of the low accuracy of 
locating the potential shot boundaries resulting from the first pass. Another feature can be implemented in the multi- 
ple-pass method is that there is an option whereby different difference metrics may be applied in different passes to 

40 increase confidence in the results. 

B. Second Embodiment of Invention 

The second embodiment of the present invention pertains to the determination of the threshold values used for 
45 determining segment boundaries, in particularthresholds T b> the shot break threshold, and T p the transition break thresh- 
old. It is of utmost importance when selecting these threshold values because it has been shown that they vary from 
one video source to another. A light" threshold makes it difficult for false transitions to be falsely accepted by the system, 
but at the risk of falsely rejecting true transitions. Conversely, a "loose" threshold enables transitions to be accepted 
consistently, but at the risk of falsely accepting false transitions. 
50 In this invention, the automatic selection of threshold T b is based on statistics of frame-to-frame differences over 

an entire or part of a given video sequence. Assuming that if there is no camera shot change or camera movement in 
a video sequence, the frame-to-frame difference value can only be due to three sources of noise: noise from digitizing 
the original analog video signal, noise introduced by video production equipment, and noise resulting from the physical 
fact that few objects remain perfectly still. All three sources of noise may be assumed to be random. Thus, the distribution 
55 of frame-to-frame differences can be decomposed into a sum of two parts: the random noises and the differences in- 
troduced by shot cut and gradual transitions. Differences due to noise do not relate to transitions. So T b is defined as 
T b = n + ao where o is the standard deviation, u. is the mean difference from frame to frame and a is a tunable parameter 
and a > 2. The other threshold, namely, Tf, is used for detecting gradual transition and is defined as T t = bT b where b 
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is between 0.1 to 0.5; according to experiments conducted. 

C. Third Embodiment of Invention 

FIG. 4 depicts the third embodiment of the present invention, an automatic content based key frame extraction 
method using temporal variation of video content. After initializing system parameters at block 502, the difference, D , 
between frames hS and / are calculated and accumulated based on a selected difference metric at block 506. If this 
accumulated difference, Z h exceeds 7* threshold for potential key frame, then, it sets the Flag to 7, a potential key 
frame has been detected, at block 509 and proceeds to block 510, where further verification is made by calculating D a , 
the difference between current frame / and last key frame recorded, F h based on a selected difference metric. If, at 
block 511, D a is greater than T & threshold for key frame, then, at block 512, the current frame, F„ is recorded as a 
current key frame and reinitialization of F^as current frame, Z k to zero, and f h image feature of previous key frame, as 
current image feature is carried out before repeating the process again from block 504 if the end of the frame of the 
video sequence has not been reached. Otherwise, if, at block 511 , D a is not greater than T & then, it proceeds to analyse 
the next frame. 

The key frame extaction method as described in FIG. 4 is different from prior art. Prior art uses motion analysis 
which heavily depends on tracing the positions and sizes of the objects being investigated using mathematical functions 
to extract a key frame. This method is not only too slow but also impractical since it relies on accurate motion field 
detection and complicated image warping. Whereas the present invention extracts key frames purely based on the 
temporal variation of the video content as described in FIG. 4. These key frames can then be used in video content 
indexing and retrieval. 

OPERATION OF THE PREFERRED EMBODIMENTS 

FIG. 5 shows a flow chart of an operation of the preferred embodiments for a video parser of FIG. 2. After selecting 
the difference metric(s) and therefore the image f eature(s) to be used in frame content comparison, by user or by default, 
as well as all the required system parameters at block 602, the system loads in a video sequence into the parser, and 
digitize it if it has not been done so. Then, the difference, D p between consecutive video frames is calculated based on 
the selected difference metric(s) at block 608. If D ; exceeds the shot break threshold, T b , and Trans is false (that is, not 
in a transition period) and the number of frame count in a shot, X F , is greater than the minimum preset number of frames 
for a shot, N smin , a cut is declared at point P1 and a shot, starting at frame F s and ending at frame F e , is recorded; and 
the detection process continues to process the following frames of the video sequence. However, if, at block 612, £ F is 
not greater than N £min , then, a key frame is recorded at block 614. At block 610, if the conditions are not true and at 
block 618, the conditions are still not true, then, further check is required at block 620. At this check, if Trans is not true 
and at block 646, D s is not greater than T p then, it proceeds to block 640 where 2^ accumulative differences after 
previous key frame, is incremented by 0, and proceeds to block 641 . Here, if the accumulative differences is not greater 
than T b , it goes to block 638; otherwise it proceeds to block 650 to calculate D a , the difference between feature //and 
f p . If, at block 644, the number of frame count in a shot, Z F , is greater than the minimum number of frame allowable for 
a single shot, /V sm/n , a potential transition is found at P2 and accordingly Trans is set to true and other relevant parameters 
are updated at block 642; otherwise, it proceeds to block 640. Coming back to block 652, if D a is greater than 57 b , a 
key frame is detected at point P5 and appropriately recorded at block 648; otherwise, it proceeds to block 638 where 
Z F is incremented by S before continuing to process the remaining frames in the video sequence. However, if, at block 
620, Trans is true, that is, it is in a potential transition phase; and, at block 622, D t is less than transition break threshold, 
T p a further distinction is achieved by calculating, D a> the difference between the current feature, f h and previous feature, 
fp, at block 624. At block 626, if D a is greater than p0,- (where p ^ 2) or the average of D,- between the frames in the 
potential transition, Z f£ /X fF , is greater than yT t (where Y ^ U then, a transition is confirmed at point P3. A shot starting 
at frame F^and ending at frame F e (=i-Z tF ) is recorded at block 628 before reinitialization of the relevant parameters at 
block 630. However, if, at block 626, the conditions are not true and at block 632, the number of failed frame count, Z tfn , 
is not less than the maximum allowable number of fails in a transition, N tmmaxi then, a false transition is declared at point 
P4 and accordingly, Trans is set to false before proceeding to block 640. At block 632, however, if the condition is 
satisfied, it proceeds to blocks 636 and 638 where the parameters as adjusted accordingly before it continues to process 
the next frames within the video sequence. At block 622, if the condition is not met, it proceeds to block 621 where 
concerned parameters are adjusted accordingly before proceeding to process the following frames in the video se- 
quence. The output data contains the frame numbers and/or time codes of the starting and ending frames as well as 
key frames of each shot 

While the present invention has been described particularly with reference to FIGS. 1 to 5, it should be understood 
that the figures are for illustration only and should not be taken as limitations on the invention. In addition, it is clear that 
the methods of the present invention have utility in many applications where analysis of image information is required. 
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It is contemplated that many changes and modifications may be made by one of ordinary skill in the art without departing 
from the spirit and the scope of the invention as described. 
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Claims 

1 . In a system for parsing a plurality of images in motion without modifying the media in which the images are recorded 
originally, said images being further divided into plurality sequences of frames, a method for selecting at least one 
key frame representative of a sequence of said images comprising the steps of: 

(a) determining a difference metric or a set of difference metrics between consecutive image frames, said dif- 
ference metrics having corresponding thresholds; 

(b) deriving a content difference (Dj), said D; being the difference between two current image frames and said 
is difference metrics, the interval between said two current image frames being adjustable with a skip factor S 

which define the resolution at which said image frames are being analysed; 

(c) accumulating Dj between every two said consecutive frames until the sum thereof exceeds a predetermined 
potential key threshold T k ; 

20 

(d) calculating a difference D a , said D a being the difference between the current frame and the previous key 
frame based on said difference metrics, or between the current frame and the first frame of said sequence 
based also on said difference metric if there is no previous key frame, the current frame becoming the key frame 
if D a exceeds a predetermined key frame threshold T d ; and 

25 

(e) continues the steps in (a) to (d) until the end frame is reached, whereby key frames for indexing sequences 
of image are identified and captured automatically. 

2. In a system for parsing a plurality of images in motion without modifying the media in which the images are recorded 
30 originally, said images being further divided into plurality sequences of frames, a method for segmenting at least 

one sequence of said images into individual camera shots, said method comprising the steps of: 

(a) determining a difference metric or a set of difference metrics between consecutive image frames, said dif- 
ference metrics having corresponding shot break thresholds T b ; 

35 

(b) deriving a content difference (Dj), said D } being the difference between two current image frames and said 
difference metrics, the interval between said two current image frames being adjustable with a skip factor S 
which define the resolution at which said image frames are being analysed; 

40 (c) declaring a sharp cut if D s exceeds said threshold T b ; 

(d) detecting the starting frame of a potential transition if said Dj exceeds a transition threshold T t but less than 
said shot break threshold T b ; 

45 (e) detecting the ending frame of a potential transition by verifying the accumulated difference, said accumulated 

difference being based on said selected difference metrics; and 

(f) continues the steps in (a) to (e) until the end frame is reached, whereby sequence of images having individual 
camera shots are identified and segmented automatically in at least one pass. 

50 

3. The method of video segmentation as in claim 2 wherein the processing speed of steps 4(a)-4(f) is enhanced with 
a multi-pass method, said multi-pass method comprising at least two steps: « 

(a) in a first pass, resolution is temporarily decreased by choosing a substantially larger skip factor S and a 
55 lower shot break threshold T 5 so as to identify rapidly the locations of potential segment boundaries without 

allowing any real boundaries to pass through without being detected; and 

(b) in subsequent passes, resolution is increased and all computation is restricted to the vicinity of said potential 
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segment boundaries whereby both camera breaks and gradual transitions are further identified. 

The method for video segmentation as in claim 3 wherein said multi-pass method can employ different difference 
metrics in different passes to increase confidence in the results. 

The method for video segmentation as in claim 3 wherein said first pass does not apply steps 4(a)-4(f). 

The method for video segmentation as in claim 3 wherein said subsequent passes apply said steps 4(a)-4(f). 

In a system for parsing a plurality of images in motion without modifying the media in which the images are recorded 
originally, said images being further divided into plurality sequences of frames, a method for segmenting at least 
one sequence of said images into individual camera shots and selecting at least one key frame representative of a 
sequence of said images, said method comprising the steps of: 

(a) determining a difference metric or a set of difference metrics between consecutive image frames, said dif- 
ference metrics having corresponding shot break thresholds T b ; 

(b) deriving a content difference (Dj), said Dj being the difference between two current image frames and said 
difference metrics, the interval between said two current image frames being adjustable with a skip factor S 
which define the resolution a I which said image frames are being analysed; 

(c) declaring a sharp cut if D t exceeds said threshold T b ; 

(d) detecting the starting frame of a potential transition if said D ( exceeds a transition threshold T t but less than 
said shot break threshold T b ; 

(e) detecting the ending frame of a potential transition by verifying the accumulated difference, said accumulated 
difference being based on said selected difference metrics; 

(f) continues the steps in (a) to (e) until the end frame is reached; 

(g) deriving a content difference (Dj), said D } being the difference between two current image frames based on 
said selected image features and said difference metric, the interval between said two current image frames 
being adjustable with a skip factor S which define the resolution at which said image frames are being analysed; 

(h) accumulating D { between every two said consecutive frames until the sum thereof exceeds a predetermined 
potential key threshold T k ; 

(i) calculating a difference D a , said D a being the difference between the current frame and the previous key 
frame based on said difference metric, or between the current frame and the first frame of said sequence based 
also on said difference metric if there is no previous key frame, the current frame becoming the key frame if D 
exceeds a predetermined key frame threshold T d ; and 

(j) continues the steps in (a) to (i) until the end frame is reached, whereby sequence of images having individual 
, camera shots are identified and segmented automatically and key frames for indexing sequences of image are 
identified and captured in at least one pass. 

The method for video segmentation as in claims 2 and 7 wherein said shot break threshold T b comprises a sum of 
the mean of the frame-to-frame difference u. and a multiple a of the standard deviation of the frame-to-frame diff ernce 

o. 

The method f©r video segmentation as in claim 8 wherein said multiple a have a value between 5 and 6 when the 
difference metric is a histrogram comparison. 

The method for video segmentation as in claims 2 and 7 wherein said transition threshold T t comprises a sum of a 
multiple b of said shot break threshold T b . 
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